The predictability of letters in written english

نویسندگان

  • Thomas Schürmann
  • Peter Grassberger
چکیده

We show that the predictability of letters in written English texts depends strongly on their position in the word. The first letters are usually the least easy to predict. This agrees with the intuitive notion that words are well defined subunits in written languages, with much weaker correlations across these units than within them. It implies that the average entropy of a letter deep inside a word is roughly 4-5 times smaller than the entropy of the first letter. PACS: 89.70+c, 02.50.Fz, 05.45.Tp Since language is used to transmit information, one of its most quantitative characteristics is the entropy, i.e., the average amount of information (usually measured in bits) per character. Entropy as a measure of information was introduced by Shannon [1]. He also performed extensive experiments [2] using the ability of humans to predict continuations of printed text. This and similar experiments [3][4] led to estimates of typically ≈ 1 − 1.5 bits per character. In contrast, the best computer algorithms whose prediction is based on sophisticated statistical methods reach entropies of ≈ 2 − 2.4 bits [5]. Even this is better than what commercial text compression packages achieve: starting from texts where each character is represented by one byte, they typically achieve compression ratios ≈ 2, corresponding to ≈ 4 bits/character. These differences result from different abilities to take into account long-range correlations which are present in all texts and whose utilization requires not only a good understanding of language but also substantial computational resources. Formally, Shannon entropy h of a letter sequence (..., s−1, s0, s1, ...) over an alphabet of d letters is given by h = − lim n→∞ ∑ s −n,...,s0 p(s−n, ..., s0) (1) × log p(s0|s−1, ..., s−n) = lim n→∞ 〈− log p(s0|s−1, ..., s−n)〉 (2) where p(s−n, ..., s0) is the probability for the letters at position −n to 0 to be s−n to s0, and p(s0|s−1, ..., s−n) = p(s −n,...,s0) p(s −n,...,s−1) . The second line of this equation tells us that h can be considered as an average over the information of bit number. While Eq. (1) obviously assumed stationarity, we can define the latter also for nonstationary sequences, provided they are distributed according to some probability p which satisfies the Kolmogorov consistency conditions. The information of the kth letter when it follows the string ..., sk−2, sk−1 is thus defined as:

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Quality of Referral Letters Written By Family Physicians to Otologists -A Peer Assessment

Introduction: Otolaryngology is a field with a high referral rate; however, there is a dearth of research on the quality of referral letters written in this field. This study was carried out to explicitly assess the quality of referral letters, more specifically in the field of otology.   Materials and Methods: Two otologists assessed referral letters written by general ...

متن کامل

A Cognitive Study of Conceptual Metaphors in English and Persian: Universal or Culture-Specific?

In the last 2 decades, studies on conceptual metaphors have profoundly increased. The development in this field was followed by Lakoff and Johnson's (1980b) work on describing the conceptual role played by metaphors and their correspondence with language and thought. This study aimed to compare conceptual metaphors in Persian and English through a corpus-based approach as well as examining both...

متن کامل

Textual Engagement of Native English Speakers in Doctoral Dissertation Discussion Sections

Academic writing is no longer considered an objective and impersonal form of discourse. It is now seen as an attempt involving interaction between writers and readers; hence, academics are not only required to produce texts representing external realities but also to use language to recognize, build, and exchange social relations. The present study aimed to analyze how native English speakers, ...

متن کامل

On the Role of Language Learners’ Psychological Reactance, Teacher Stroke, and Teacher Success in the Iranian Context

Given the importance of psychological reactance in social and educational interactions and its influence on language learning/teaching, the present study intended to investigate the relationship among stroke, psychological reactance, and teacher success. To this end, a total number of 300 Iranian English learners from different English language institutes filled out a newly developed scale on p...

متن کامل

A Comparative Analysis of Self-Mentions in Applied Linguistics PhD Dissertations Written by Native and Non-Native English Writers

The purpose of the present study was to compare the PhD dissertations written by native and nonnative English writers in the field of Applied Linguistics with regard to the use of self-mentions. To this end, 40 Applied Linguistics PhD dissertations (20 written by native English writers and 20 by non-native English writers), were selected randomly among academic texts written in 2007-2017. The p...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/0710.4516  شماره 

صفحات  -

تاریخ انتشار 2007